In this paper, we attempt to examine whether the use of change-point analysis techniques is appropriate for detecting fatigue based on data captured from wearable sensors. As such, we perform a secondary data analysis to the data generated in: Baghdadi et al., 2018. The reader should note that their raw data was preprocessed using:
Kalman Filter: Used to process the raw data from sensors to: (i) estimate the spatial orientation of the body with respect to the global reference frame, and (ii) to estimate the kinematics of motion.
Segmentation: The motion segments where then segmented using an algorithm that assumes the existence of two peaks in the translational acceleration of the gait cycle. This assumption was justified based on the results of Tongen and Wunderlich, 2010 as well as the authors’ preliminary analyses.
The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.
The snippet below documents the list of R libraries that were used in this research. For convenience, we used the pacman package since it allows for installing/loading the needed libraries in one step.
rm(list = ls()) # clear global environment
graphics.off() # close all graphics
library(pacman) # needs to be installed first
p_load(R.matlab, plotly, extrafont, grDevices, gridExtra,
dplyr, stringr, tidyverse, utils, reshape2,
anomalize, forecast, MVN, fractal,
ecp)
In the snippet below, we extract the 15 “.mat” files in the GitHub repository (where we loaded the data to allow for the reproduction of our work). Note that these files were originally produced in Baghdadi et al., 2018. Then, we perform several transformation steps: (a) extracting the data for the first three columns in the matlab arrays; and (b) computing three kinematic features from the data corresponding to these columns. Due to the differences between Matlab and R, this requires two nested for loops. The outer loop increments over the number of subjects, while the inner loop increments based on the different number of rows of data for each subject. Please see the comments within the code chunk for more details.
num_subjects <- seq(1, 15)
subject_heights <- c(1.71, 1.77, 1.71, 1.59, 1.69,
1.63, 1.60, 1.71, 1.67, 1.78,
1.68, 1.55, 1.83, 1.81, 1.89)
# Initilizing a df for summary data on participants
summary_df <- data.frame(matrix(nrow = 15, ncol = 9))
colnames(summary_df) <- c("Subject.Num", "num.rows",
"num.cols", "mean.scaled.stride.len",
"sd.scaled.stride.len",
"mean.scaled.stride.height",
"sd.scaled.stride.height",
"mean.stride.duration",
"sd.stride.duartion")
for (i in 1:length(num_subjects)) {
# Reading the .mat files from GitHub
raw_data <- readMat(paste0("https://github.com/fmegahed/fatigue-changepoint/blob/master/Data/Raw/Subject",num_subjects[i],".mat?raw=true"))
# Compute the number of cells, and rows in each structered matrix
raw_data_size <- lengths(raw_data) # num of cells
num_rows <- raw_data_size / 17 # all data had 17 cols
# Initilizing the six lists needed for storing the data (we keep track of the top 3 for error checking)
time_in_sec <- vector("list", length = num_rows)
position_x <- vector("list", length = num_rows)
position_y <- vector("list", length = num_rows)
stride_time <- vector("list", length = num_rows)
stride_length <- vector("list", length = num_rows)
stride_height <- vector("list", length = num_rows)
stride_duration <- vector("list", length = num_rows)
# Following for loop is needed since R reads the structured array as a nested list. The list containing the data is called "M.i.k" and it transforms/reads the original array --> rowise. This means that our first three features (with the same timestamp) are always seperated with a distance equal to the total number of rows
for (j in 1:num_rows) {
position_x[[j]] <- raw_data[["M.i.k"]][[j]]
position_y[[j]] <- raw_data[["M.i.k"]][[num_rows + j]]
stride_time[[j]] <- raw_data[["M.i.k"]][[2 * num_rows + j]]
dataholder <- raw_data[["M.i.k"]][[16 * num_rows + j]] # data holder for time
# Computing the three needed kinematic features
stride_length[[j]] <-
range(position_x[[j]])[2] - range(position_x[[j]])[1]
stride_height[[j]] <-
range(position_y[[j]])[2] - range(position_y[[j]])[1]
stride_duration[[j]] <-
range(stride_time[[j]])[2] - range(stride_time[[j]])[1]
time_in_sec[[j]] <- lapply(dataholder, mean)# using mean time of stride as a time stamp
}
# Scaling and creating one data frame per subject
assign(paste0("subject", i, "_features"),
data.frame(time.from.start = unlist(time_in_sec),
scaled.stride.len = unlist(stride_length)/subject_heights[i],
scaled.stride.height = unlist(stride_height) / subject_heights[i],
stride.duration = unlist(stride_duration)
)
)
# Creating a Summary Data Frame
df_name <- paste0("subject", i, "_features")
summary_df[i, 1] <- paste0("subject", i)
summary_df[i, 2] <- get(df_name) %>% nrow()
summary_df[i, 3] <- get(df_name) %>% ncol()
summary_df[i, 4] <- get(df_name)[, 2] %>% mean() %>% round(digits = 4)
summary_df[i, 5] <- get(df_name)[, 2] %>% sd() %>% round(digits = 4)
summary_df[i, 6] <- get(df_name)[, 3] %>% mean() %>% round(digits = 4)
summary_df[i, 7] <- get(df_name)[, 3] %>% sd() %>% round(digits = 4)
summary_df[i, 8] <- get(df_name)[, 4] %>% mean() %>% round(digits = 4)
summary_df[i, 9] <- get(df_name)[, 4] %>% sd() %>% round(digits = 4)
}
# Printing the top six rows of Subject 4's data as an example
head(subject4_features) %>% round(digits = 3)
# A Summary of the features for all 15 participants
summary_df
rm(raw_data, raw_data_size, i, j, num_rows,
dataholder, num_subjects)
save.image(file = "./Data/RGenerated/FeatureGeneration.RData")
Based on the analysis above, there are three observations to be made. First, we scaled the stride length and height based on the subject’s height. This in essence allows us to capture the steps as a percentage of the person’s height. This reduces the between subject variablity in the data and is supported by the seminal work of Oberg et al., 1993. Second, we showed the first six rows from Subject 4 to provide the reader with some insights into the sampling frequeny (after the preprocessing done in Baghdadi et al. 2018). Note that the kinematic features are computed from the sensor channels provided in their paper. Third, we saved the generated data into an R.data file, which can be accessed by clicking on: FeatureGeneration.RData. We hope that this saved data allows other researchers to reproduce and/or build on our work.
In this task, we examined several approaches for outlier detection. We originally hypothesized that these outliers constitue faulty sensor readings (since they are not sustained), and thus, we can remove this data prior to any further analysis. However, based on a detailed investigation of the data and the application of several outlier detection methods (univariate and multivariate), we have learned that the data is non-normal, highly autocorrelated and non-stationary. Thus, these approaches are not suitable for removing the outliers from our data. Thus, we then investiaged the utilization of “filtering” based methods in Section 3 to impute/correct/smoothen the data.
For the sake of reproducibility and completeness, we provide the results from our implementation of several outlier detection methods in the subsections below. Note that these methods will not be incorporated in our final analysis. They are presented here only to document how we reached our final model/approach.
First, we use a standard line plot to depict the data for each feature by person. For the sake of facilitating the visualization process, we: (a) panel the plot such that the left, center and right panels correspond to the scaled stride length, scaled stride height and step duration, respectively; and (b) we save the results of each participant (P) in a different tab.
# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
for (i in 1:15) {
df_transformed <- melt(get(paste0("subject",i,"_features")),
id.vars = "time.from.start",
measure.vars=c("scaled.stride.len",
"scaled.stride.height",
"stride.duration"
)
)
assign(paste0("p",i),
ggplot(data = df_transformed,
aes(x=time.from.start,y=value,group=variable,color=variable,
shape=variable)) +
geom_line() + theme_bw() +
#ylim(0,2) +
ggtitle(paste0("Participant ",i,": Pre-filtered Data")) +
theme(legend.position="none") +
facet_wrap(~variable,nrow=3, scales = "free_y")
)
cat('###',paste0("P", i), "{-}",'\n')
print(get(paste0("p",i)))
cat('\n \n')
stat_printdf <- get(paste0("subject",i,"_features"))
cat(paste0('<source> <p> Based on the <b>raw data</b>, Participant ', i,
' has an average scaled stride length of ',
round(mean(stat_printdf[,2]),2),
', an average scaled stride height of ',
round(mean(stat_printdf[,3]),2),
' and an average stride duration of ', round(mean(stat_printdf[,4]),2),
' seconds. Based on their height of ', subject_heights[i],
' meters, their average stride length and stride height are', ' ',
round(mean(stat_printdf[,2])*subject_heights[i],2),
' and ', ' ', round(mean(stat_printdf[,3])*subject_heights[i],2),
', respectively.', '</p> </source>'
)
)
cat(' \n \n')
}
Based on the raw data, Participant 1 has an average scaled stride length of 1.07, an average scaled stride height of 0.09 and an average stride duration of 1.06 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.82 and 0.16, respectively.
Based on the raw data, Participant 2 has an average scaled stride length of 1.63, an average scaled stride height of 0.13 and an average stride duration of 0.95 seconds. Based on their height of 1.77 meters, their average stride length and stride height are 2.88 and 0.23, respectively.
Based on the raw data, Participant 3 has an average scaled stride length of 0.94, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.61 and 0.16, respectively.
Based on the raw data, Participant 4 has an average scaled stride length of 1.28, an average scaled stride height of 0.12 and an average stride duration of 0.95 seconds. Based on their height of 1.59 meters, their average stride length and stride height are 2.03 and 0.2, respectively.
Based on the raw data, Participant 5 has an average scaled stride length of 0.96, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.69 meters, their average stride length and stride height are 1.63 and 0.15, respectively.
Based on the raw data, Participant 6 has an average scaled stride length of 0.91, an average scaled stride height of 0.11 and an average stride duration of 1.06 seconds. Based on their height of 1.63 meters, their average stride length and stride height are 1.48 and 0.18, respectively.
Based on the raw data, Participant 7 has an average scaled stride length of 0.84, an average scaled stride height of 0.09 and an average stride duration of 1.12 seconds. Based on their height of 1.6 meters, their average stride length and stride height are 1.34 and 0.15, respectively.
Based on the raw data, Participant 8 has an average scaled stride length of 1.53, an average scaled stride height of 0.14 and an average stride duration of 0.98 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 2.62 and 0.24, respectively.
Based on the raw data, Participant 9 has an average scaled stride length of 0.87, an average scaled stride height of 0.1 and an average stride duration of 1.11 seconds. Based on their height of 1.67 meters, their average stride length and stride height are 1.46 and 0.17, respectively.
Based on the raw data, Participant 10 has an average scaled stride length of 1.28, an average scaled stride height of 0.11 and an average stride duration of 1.01 seconds. Based on their height of 1.78 meters, their average stride length and stride height are 2.27 and 0.2, respectively.
Based on the raw data, Participant 11 has an average scaled stride length of 1.2, an average scaled stride height of 0.09 and an average stride duration of 1.05 seconds. Based on their height of 1.68 meters, their average stride length and stride height are 2.02 and 0.15, respectively.
Based on the raw data, Participant 12 has an average scaled stride length of 1.61, an average scaled stride height of 0.14 and an average stride duration of 0.89 seconds. Based on their height of 1.55 meters, their average stride length and stride height are 2.5 and 0.21, respectively.
Based on the raw data, Participant 13 has an average scaled stride length of 0.67, an average scaled stride height of 0.07 and an average stride duration of 1.14 seconds. Based on their height of 1.83 meters, their average stride length and stride height are 1.23 and 0.13, respectively.
Based on the raw data, Participant 14 has an average scaled stride length of 1.32, an average scaled stride height of 0.14 and an average stride duration of 0.93 seconds. Based on their height of 1.81 meters, their average stride length and stride height are 2.39 and 0.25, respectively.
Based on the raw data, Participant 15 has an average scaled stride length of 1.18, an average scaled stride height of 0.1 and an average stride duration of 1.04 seconds. Based on their height of 1.89 meters, their average stride length and stride height are 2.24 and 0.2, respectively.
One of the most commonly used approaches for univariate outliers are defined to be the observations that lie outside \(1.5 * IQR\), where the IQR is the inter quartile range. This can be easily implemented using in base R using the boxplot.stats function. Below, we first provide the output and data plots for each person.
# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
outliers_bp <- list() # initilizing a list to store outliers per participant
for (i in 1:15) {
df <- get(paste0("subject",i,"_features")) # Getting Data from Sec.
# Obtaining the Outliers for each of the three variables
out_vals_sl<- boxplot.stats(df[,2])$out
out_rows_sl <- which(df[,2] %in% out_vals_sl)
out_vals_sh<- boxplot.stats(df[,3])$out
out_rows_sh <- which(df[,3] %in% out_vals_sh)
out_vals_sd<- boxplot.stats(df[,4])$out
out_rows_sd <- which(df[,4] %in% out_vals_sd)
# Generating a union set of all obs. that have outliers
# True: if any of the 3 vars for that obs. is an outlier
outliers_total <- unique(c(out_rows_sl, out_rows_sh,
out_rows_sd))
outliers_bp[[i]] <- outliers_total # saving it to list indexed by participant number
# Remove the observations corresponding to the outliers
assign(paste0("subject",i,"_bp"),
df[-outliers_total,])
# Preparing the data for the Line Graph
df_transformed <- melt(get(paste0("subject",i,"_bp")),
id.vars = "time.from.start",
measure.vars=c("scaled.stride.len",
"scaled.stride.height",
"stride.duration"
)
) # ggplot data needs to be tall
assign(paste0("g",i),
ggplot(data = df_transformed,
aes(x=time.from.start, y=value, group=variable,
color=variable,shape=variable)) +
geom_line() + theme_bw() +
#ylim(0,2) +
ggtitle("Outliers removed via the standard boxplot method \n (any point outside of 1.5*IQR was removed)") +
theme(legend.position="none",
#axis.text.x=element_text(angle=90,hjust=1),
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~variable,nrow=3, scales = 'free_y')
)
cat('####',paste0("P", i), "{-}",'\n')
# Printing the % of outliers removed
cat("\n")
num_rows <- nrow(get(paste0("subject",i,"_features")))
num_outliers <- length(outliers_total)
percent_data <- round(100*num_outliers/num_rows,2)
cat("<b>",paste0("% of removed Observations for Partcipant ",i,
": ", percent_data,"%."), "</b>",
"This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.")
cat('\n \n')
# Plotting the data without the outliers
print(get(paste0("g",i)))
cat('\n \n')
stat_printdf <- get(paste0("subject",i,"_bp"))
cat(paste0('<source> <p> After applying the <b>boxplot outlier detection methodology</b>, Participant ', i,
' has an average scaled stride length of ',
round(mean(stat_printdf[,2]),2),
', an average scaled stride height of ',
round(mean(stat_printdf[,3]),2),
' and an average stride duration of ', round(mean(stat_printdf[,4]),2),
' seconds. Based on their height of ', subject_heights[i],
' meters, their average stride length and stride height are', ' ',
round(mean(stat_printdf[,2])*subject_heights[i],2),
' and ', ' ', round(mean(stat_printdf[,3])*subject_heights[i],2),
', respectively.', '</p> </source>'
)
)
cat('\n')
cat('<source> <p> Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the <i>tsoutlier() function</i> in the <a href="https://cran.r-project.org/web/packages/forecast/forecast.pdf">forecast package</a> and the <i> iqr() function </i> in the <a href="https://cran.r-project.org/web/packages/anomalize/anomalize.pdf">anomalize package</a> use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify. </p> </source>')
cat(' \n \n')
}
% of removed Observations for Partcipant 1: 2.13%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 1 has an average scaled stride length of 1.07, an average scaled stride height of 0.09 and an average stride duration of 1.06 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.83 and 0.16, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 2: 0.51%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 2 has an average scaled stride length of 1.63, an average scaled stride height of 0.13 and an average stride duration of 0.95 seconds. Based on their height of 1.77 meters, their average stride length and stride height are 2.89 and 0.23, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 3: 6.1%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 3 has an average scaled stride length of 0.94, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.61 and 0.15, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 4: 8.93%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 4 has an average scaled stride length of 1.32, an average scaled stride height of 0.12 and an average stride duration of 0.94 seconds. Based on their height of 1.59 meters, their average stride length and stride height are 2.1 and 0.2, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 5: 7.61%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 5 has an average scaled stride length of 0.99, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.69 meters, their average stride length and stride height are 1.66 and 0.15, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 6: 5.01%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 6 has an average scaled stride length of 0.91, an average scaled stride height of 0.11 and an average stride duration of 1.06 seconds. Based on their height of 1.63 meters, their average stride length and stride height are 1.49 and 0.18, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 7: 10.85%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 7 has an average scaled stride length of 0.85, an average scaled stride height of 0.09 and an average stride duration of 1.14 seconds. Based on their height of 1.6 meters, their average stride length and stride height are 1.36 and 0.15, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 8: 11.2%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 8 has an average scaled stride length of 1.6, an average scaled stride height of 0.14 and an average stride duration of 0.96 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 2.74 and 0.24, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 9: 8.76%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 9 has an average scaled stride length of 0.86, an average scaled stride height of 0.1 and an average stride duration of 1.12 seconds. Based on their height of 1.67 meters, their average stride length and stride height are 1.44 and 0.17, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 10: 7.32%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 10 has an average scaled stride length of 1.33, an average scaled stride height of 0.11 and an average stride duration of 1 seconds. Based on their height of 1.78 meters, their average stride length and stride height are 2.36 and 0.19, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 11: 8.81%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 11 has an average scaled stride length of 1.22, an average scaled stride height of 0.09 and an average stride duration of 1.05 seconds. Based on their height of 1.68 meters, their average stride length and stride height are 2.06 and 0.15, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 12: 10.86%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 12 has an average scaled stride length of 1.68, an average scaled stride height of 0.14 and an average stride duration of 0.88 seconds. Based on their height of 1.55 meters, their average stride length and stride height are 2.61 and 0.21, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 13: 7.74%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 13 has an average scaled stride length of 0.68, an average scaled stride height of 0.07 and an average stride duration of 1.15 seconds. Based on their height of 1.83 meters, their average stride length and stride height are 1.25 and 0.13, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 14: 10.93%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 14 has an average scaled stride length of 1.39, an average scaled stride height of 0.14 and an average stride duration of 0.91 seconds. Based on their height of 1.81 meters, their average stride length and stride height are 2.51 and 0.25, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
% of removed Observations for Partcipant 15: 4.98%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.
After applying the boxplot outlier detection methodology, Participant 15 has an average scaled stride length of 1.2, an average scaled stride height of 0.1 and an average stride duration of 1.04 seconds. Based on their height of 1.89 meters, their average stride length and stride height are 2.27 and 0.19, respectively.
Note that we could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.
# Saving a list of cleaned
save(subject1_bp, subject2_bp, subject3_bp, subject4_bp,
subject5_bp, subject6_bp, subject7_bp, subject8_bp,
subject9_bp, subject10_bp, subject11_bp, subject12_bp,
subject13_bp, subject14_bp, subject15_bp, outliers_bp,
file = "./Data/RGenerated/OutliersRemovedBoxplot.RData")
# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
mal_results <- {}
for (i in 1:15) {
df <- get(paste0("subject",i,"_features"))
df <- df[,2:4]#only taking the four numeric fields
cat('####',paste0("P", i), "{-}",' \n')
cat('The output from the <source> <i> MVN package </i> </source> is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. </source> <b> Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. </b> </source>')
plot.new()
mvn_test <- mvn(df, mvnTest = "dh",desc = TRUE,
univariateTest = "Lillie",
multivariateOutlierMethod = "adj",
showOutliers = TRUE) #adjusted Mahl Distance
cat(' \n')
cat(paste0('In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant ',i,' below. \n \n'))
mvn_Normality <- mvn_test$multivariateNormality
rownames(mvn_Normality) <- paste('Participant',i)
tab <- xtable::xtable(mvn_test$multivariateNormality,
align = c("c","c","c","c","c","c"))
print(tab, type= 'html',
html.table.attributes = 'align="center", rules="rows",
width=50%, frame="below"')
cat('<source> <p> \n </p> </source>')
cat(paste0('Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant ', i, ' but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.'))
cat(' \n \n')
}
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 1 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 14677.13 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 1 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 2 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 7021.46 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 2 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 3 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 2778.83 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 3 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 4 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 10446.56 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 4 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 5 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 6516.14 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 5 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 6 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 15389.84 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 6 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 7 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 3109.05 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 7 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 8 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 6399.80 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 8 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 9 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 5815.28 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 9 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 10 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 14166.64 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 10 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 11 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 15132.69 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 11 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 12 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 8902.87 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 12 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 13 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 6561.33 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 13 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 14 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 7265.08 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 14 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
The output from the In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 15 below.
| Test | E | df | p value | MVN | |
|---|---|---|---|---|---|
| 1 | Doornik-Hansen | 22184.55 | 6.00 | 0.00 | NO |
Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 15 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.
An alternative approach to detecting and then removing outliers is to use smoothing techniques to “correct” the data. Based on the results shown in the previous sections, the fundamental hypotheses behind this approach are:
The sensors’ data seem to be non-stationary, autocorrelated and somewhat complex (from a statistical perspective). More importantly, there are spikes in the data that cannot be explained by the kinematic model. These spikes produce values that are unrealistic (e.g., a scaled stride length > 1 is not possible since that means that the stride of the person is larger than their height). While this insight is useful, there is no hard limit that we can enforce since the literature reports mean values across the population instead of attempting to provide a physical threshold on what is possible.
The outliers tended to appear in pairs of two; indicating a possible issue in the segmentation of the data. Note that we label this data as outliers since it is not sustained (i.e., it is unlikely to say that a person is fatigued for only two stride).
Smoothing (i.e. a low pass filter on the data) will allow us to impute values for every stride (after an initial size window). This will allow us to preserve the successive nature of the strides, “fix” the outliers based on neighboring strides, and not throw away any observation.
An algorithmic approach based on these observations is implemented below. Note that we are using the median filter from the fractal package on CRAN. Note that in the analysis below, we are applying the median Filter, with n=29 (i.e., the median is calculated based on a overlaping moving window of size 29 strides). We chose this window size based on the recommendation of our biomechanics expert as it allows us to: (a) ignore the effect of any turns in the direction of motion; and (b) the corresponding window size is approximately 30 seconds long (which presents an easy to remember rule for practical application).
# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
for (i in 1:15) {
df <- get(paste0("subject",i,"_features")) # Getting Data from Sec.
# Obtaining the Outliers for each of the three variables
time.from.start <- df[,1]
scaled.stride.len <- medianFilter(df[,2], order = 29) # moving window of size 29 obs
scaled.stride.height <- medianFilter(df[,3], order = 29) # moving window of size 29 obs
stride.duration <- medianFilter(df[,4], order = 29) # moving window of size 29 obs
# Computing the first difference to make seein the values easier
diff.time.from.start <- time.from.start[-1]
diff.scaled.stride.len <- diff(scaled.stride.len)
diff.scaled.stride.height <- diff(scaled.stride.height)
diff.stride.duration <- diff(stride.duration)
# Remove the observations corresponding to the outliers
assign(paste0("subject",i,"_medianf"),
data.frame(cbind(time.from.start, scaled.stride.len,
scaled.stride.height, stride.duration)))
assign(paste0("subject",i,"_median_diff_f"),
data.frame(cbind( diff.time.from.start, diff.scaled.stride.len,
diff.scaled.stride.height, diff.stride.duration)))
# Preparing the data for the Line Graph
df_transformed <- melt(get(paste0("subject",i,"_medianf")),
id.vars = "time.from.start",
measure.vars=c("scaled.stride.len",
"scaled.stride.height",
"stride.duration"
)
) # ggplot data needs to be tall
diff_df_transformed <- melt(get(paste0("subject",i,"_median_diff_f")),
id.vars = "diff.time.from.start",
measure.vars=c("diff.scaled.stride.len",
"diff.scaled.stride.height",
"diff.stride.duration"
)
) # ggplot data needs to be tall
assign(paste0("g",i),
ggplot(data = df_transformed,
aes(x=time.from.start, y=value, group=variable,
color=variable,shape=variable)) +
geom_line() + theme_bw() +
ggtitle("Data post the application of the median filter") +
theme(legend.position="none",
#axis.text.x=element_text(angle=90,hjust=1),
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~variable,nrow=3, scales = "free_y")
)
assign(paste0("h",i),
ggplot(data = diff_df_transformed,
aes(x=diff.time.from.start, y=value, group=variable,
color=variable,shape=variable)) +
geom_line() + theme_bw() +
ggtitle("First Difference of the Signal post the median filter") +
theme(legend.position="none",
#axis.text.x=element_text(angle=90,hjust=1),
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~variable,nrow=3, scales = "free_y")
)
cat('###',paste0("P", i), "{-}",'\n')
print(get(paste0("g",i)))
cat('\n')
cat(paste0('<source> <p> Based on the <b>median filter</b>, Participant ', i,
' has an average scaled stride length of ',
round(mean(scaled.stride.len),2),
', an average scaled stride height of ',
round(mean(scaled.stride.height),2),
' and an average stride duration of ', round(mean(stride.duration),2),
' seconds. Based on their height of ', subject_heights[i],
' meters, their average stride length and stride height are', ' ',
round(mean(scaled.stride.len)*subject_heights[i],2),
' and ', ' ', round(mean(scaled.stride.height)*subject_heights[i],2),
', respectively.', '</p> </source>'
)
)
cat('\n')
print(get(paste0("h",i)))
cat(paste0('<source> <p> Based on the <b>median filter </b>the interquarile range of the <b>first difference </b> of Participant ', i,
'\'s scaled stride length, scaled stride height and stride duration are',
' ',
round(IQR(diff.scaled.stride.len),3),
', ', round(IQR(diff.scaled.stride.height),3),
', and ', round(IQR(diff.stride.duration),3),
', respectively. </p> </source>'
)
)
cat(' \n \n')
cat('<source> <p> The data processed with the median filtered approach is stored at the <a href="https://github.com/fmegahed/fatigue-changepoint/blob/master/Data/RGenerated/MedianFilteredData.RData">following location</a> in our GitHub Repository.')
cat('\n \n')
}
Based on the median filter, Participant 1 has an average scaled stride length of 1.08, an average scaled stride height of 0.09 and an average stride duration of 1.06 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.84 and 0.16, respectively.
Based on the median filter the interquarile range of the first difference of Participant 1’s scaled stride length, scaled stride height and stride duration are 0.001, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 2 has an average scaled stride length of 1.65, an average scaled stride height of 0.13 and an average stride duration of 0.94 seconds. Based on their height of 1.77 meters, their average stride length and stride height are 2.91 and 0.24, respectively.
Based on the median filter the interquarile range of the first difference of Participant 2’s scaled stride length, scaled stride height and stride duration are 0.001, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 3 has an average scaled stride length of 0.94, an average scaled stride height of 0.09 and an average stride duration of 1.08 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 1.61 and 0.15, respectively.
Based on the median filter the interquarile range of the first difference of Participant 3’s scaled stride length, scaled stride height and stride duration are 0.003, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 4 has an average scaled stride length of 1.31, an average scaled stride height of 0.13 and an average stride duration of 0.93 seconds. Based on their height of 1.59 meters, their average stride length and stride height are 2.09 and 0.2, respectively.
Based on the median filter the interquarile range of the first difference of Participant 4’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 5 has an average scaled stride length of 0.98, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.69 meters, their average stride length and stride height are 1.66 and 0.15, respectively.
Based on the median filter the interquarile range of the first difference of Participant 5’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 6 has an average scaled stride length of 0.92, an average scaled stride height of 0.11 and an average stride duration of 1.05 seconds. Based on their height of 1.63 meters, their average stride length and stride height are 1.5 and 0.17, respectively.
Based on the median filter the interquarile range of the first difference of Participant 6’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 7 has an average scaled stride length of 0.85, an average scaled stride height of 0.09 and an average stride duration of 1.13 seconds. Based on their height of 1.6 meters, their average stride length and stride height are 1.36 and 0.15, respectively.
Based on the median filter the interquarile range of the first difference of Participant 7’s scaled stride length, scaled stride height and stride duration are 0.001, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 8 has an average scaled stride length of 1.59, an average scaled stride height of 0.14 and an average stride duration of 0.96 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 2.72 and 0.24, respectively.
Based on the median filter the interquarile range of the first difference of Participant 8’s scaled stride length, scaled stride height and stride duration are 0.004, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 9 has an average scaled stride length of 0.87, an average scaled stride height of 0.1 and an average stride duration of 1.11 seconds. Based on their height of 1.67 meters, their average stride length and stride height are 1.46 and 0.17, respectively.
Based on the median filter the interquarile range of the first difference of Participant 9’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 10 has an average scaled stride length of 1.34, an average scaled stride height of 0.11 and an average stride duration of 0.99 seconds. Based on their height of 1.78 meters, their average stride length and stride height are 2.38 and 0.2, respectively.
Based on the median filter the interquarile range of the first difference of Participant 10’s scaled stride length, scaled stride height and stride duration are 0.003, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 11 has an average scaled stride length of 1.22, an average scaled stride height of 0.09 and an average stride duration of 1.05 seconds. Based on their height of 1.68 meters, their average stride length and stride height are 2.04 and 0.15, respectively.
Based on the median filter the interquarile range of the first difference of Participant 11’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 12 has an average scaled stride length of 1.64, an average scaled stride height of 0.14 and an average stride duration of 0.88 seconds. Based on their height of 1.55 meters, their average stride length and stride height are 2.55 and 0.22, respectively.
Based on the median filter the interquarile range of the first difference of Participant 12’s scaled stride length, scaled stride height and stride duration are 0.001, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 13 has an average scaled stride length of 0.68, an average scaled stride height of 0.07 and an average stride duration of 1.15 seconds. Based on their height of 1.83 meters, their average stride length and stride height are 1.24 and 0.13, respectively.
Based on the median filter the interquarile range of the first difference of Participant 13’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 14 has an average scaled stride length of 1.37, an average scaled stride height of 0.14 and an average stride duration of 0.92 seconds. Based on their height of 1.81 meters, their average stride length and stride height are 2.48 and 0.25, respectively.
Based on the median filter the interquarile range of the first difference of Participant 14’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
Based on the median filter, Participant 15 has an average scaled stride length of 1.22, an average scaled stride height of 0.1 and an average stride duration of 1.03 seconds. Based on their height of 1.89 meters, their average stride length and stride height are 2.3 and 0.19, respectively.
Based on the median filter the interquarile range of the first difference of Participant 15’s scaled stride length, scaled stride height and stride duration are 0.002, 0, and 0, respectively.
The data processed with the median filtered approach is stored at the following location in our GitHub Repository.
# Saving a list of cleaned
save(subject1_medianf, subject2_medianf, subject3_medianf, subject4_medianf,
subject5_medianf, subject6_medianf, subject7_medianf, subject8_medianf,
subject9_medianf, subject10_medianf, subject11_medianf, subject12_medianf,
subject13_medianf, subject14_medianf, subject15_medianf,
subject1_median_diff_f, subject2_median_diff_f, subject3_median_diff_f,
subject4_median_diff_f,
subject5_median_diff_f, subject6_median_diff_f, subject7_median_diff_f,
subject8_median_diff_f,
subject9_median_diff_f, subject10_median_diff_f, subject11_median_diff_f,
subject12_median_diff_f,
subject13_median_diff_f, subject14_median_diff_f, subject15_median_diff_f,
file = "./Data/RGenerated/MedianFilteredData.RData")
To be examined later.
To be filled later.
In this section, we will examine the use of the approach of Matteson and James (2014) for multiple change point analysis of our trivariate data. Our analysis takes advantage of their R package titled ecp, which is described in more details in the following paper: James and Matteson (2014). The multivariate changepoint method will be first tested using the median filtered data and then using the CUSUM of the median filtered data.
To apply the ecp approach of Matterson and James (2014) on the median filtered data, we need to use a window size (which reflects the smallest size of a shift that is to be detected). From a biomechanics perspective, we expect that the change of performance due to fatigue will be sustained at least for several minutes (until the person adapts and/or rests). Thus, the size of the window (\(m_{ecp}\)) should be \(m_{ecp} \ge 60\). In our analysis, we tried the following window sizes: 60, 120, 180, 300, and 600. For most of the participants, the number and/or location of change were almost similar. For demonstration purpose, we use \(m_{ecp} = 180\) below.
load(file = "./Data/RGenerated/MedianFilteredData.RData")
pen <- function(x) -length(x) #Equally penalizes every additional changepoint
window.size <- 180
changepoints_medianf <- list()
for (i in 1:15) {
df <- get(paste0("subject",i,"_medianf"))
# Making the DF divisiable by the window size
# (i.e., we are removing the remainder so that our chunks are equally sized)
rows.to.keep <- nrow(df[,2:4]) %/% window.size
remainder <- nrow(df[,2:4]) %% window.size
df.MultivariateChangepoint<- df[1:(nrow(df)-remainder),
2:4] %>% as.matrix()
# Utilizing the e.agglo() from the ecp package
mem <- rep(1:rows.to.keep, each=window.size)
y = e.agglo(X=df.MultivariateChangepoint,
member= mem,
alpha = 1,
penalty = pen)
changepoints_medianf[[i]] <- y$estimates
cat('####',paste0("P", i), "{-}",'\n')
df_transformed <- melt(get(paste0("subject",i,"_medianf")),
id.vars = "time.from.start",
measure.vars=c("scaled.stride.len",
"scaled.stride.height",
"stride.duration"
)
) # ggplot data needs to be tall
assign(paste0("g",i),
ggplot(data = df_transformed,
aes(x=time.from.start, y=value, group=variable,
color=variable,shape=variable)) +
geom_line() + theme_bw() +
ggtitle("Changepoints for the Median Filtered Data (at vertical black lines)") +
theme(legend.position="none",
#axis.text.x=element_text(angle=90,hjust=1),
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~variable,nrow=3, scales = "free_y") +
geom_vline(xintercept= y$estimates)
)
print(get(paste0("g",i)))
cat('\n')
cat(paste0('<source> <p> Based on the <b>ecp package</b>, the number of changepoints for participant ', i,
' is equal to: ',
length(changepoints_medianf[[i]]),
'. These changepoints are located at: ',
paste(changepoints_medianf[[i]], collapse = ", "),
'. The results from the analysis above can be found at: <a href="https://github.com/fmegahed/fatigue-changepoint/blob/master/Data/RGenerated/ECPChangePointsMFData.RData">ECPChangePointsMFData.RData</a>. </p> </source>')
)
cat('\n \n')
}
Based on the ecp package, the number of changepoints for participant 1 is equal to: 2. These changepoints are located at: 3061, 3961. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 2 is equal to: 2. These changepoints are located at: 181, 3781. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 3 is equal to: 2. These changepoints are located at: 901, 1261. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 4 is equal to: 4. These changepoints are located at: 1, 1261, 5221, 7741. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 5 is equal to: 3. These changepoints are located at: 1, 1801, 3601. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 6 is equal to: 2. These changepoints are located at: 3241, 5401. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 7 is equal to: 2. These changepoints are located at: 1, 3421. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 8 is equal to: 2. These changepoints are located at: 1081, 3601. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 9 is equal to: 3. These changepoints are located at: 1, 4681, 5581. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 10 is equal to: 3. These changepoints are located at: 1, 2161, 6661. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 11 is equal to: 3. These changepoints are located at: 1, 1621, 7381. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 12 is equal to: 2. These changepoints are located at: 2161, 2341. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 13 is equal to: 3. These changepoints are located at: 1, 361, 6301. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 14 is equal to: 3. These changepoints are located at: 1801, 1981, 4861. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
Based on the ecp package, the number of changepoints for participant 15 is equal to: 2. These changepoints are located at: 4321, 6121. The results from the analysis above can be found at: ECPChangePointsMFData.RData.
save(changepoints_medianf,
file = "./Data/RGenerated/ECPChangePointsMFData.RData")
Here, we apply the ecp approach of Matterson and James (2014) on the median filtered differenced data (as opposed to the median filtered timeseries). For demonstration purpose, we use \(m_{ecp} = 180\) below.
load(file = "./Data/RGenerated/MedianFilteredData.RData")
pen <- function(x) -length(x) #Equally penalizes every additional changepoint
window.size <- 180
changepoints_medianf_diff <- list()
for (i in 1:15) {
df <- get(paste0("subject",i,"_median_diff_f"))
# Making the DF divisiable by the window size
# (i.e., we are removing the remainder so that our chunks are equally sized)
rows.to.keep <- nrow(df[,2:4]) %/% window.size
remainder <- nrow(df[,2:4]) %% window.size
df.MultivariateChangepoint<- df[1:(nrow(df)-remainder),
2:4] %>% as.matrix()
# Utilizing the e.agglo() from the ecp package
mem <- rep(1:rows.to.keep, each=window.size)
y = e.agglo(X=df.MultivariateChangepoint,
member= mem,
alpha = 1,
penalty = pen)
changepoints_medianf_diff[[i]] <- y$estimates
cat('####',paste0("P", i), "{-}",'\n')
df_transformed <- melt(get(paste0("subject",i,"_median_diff_f")),
id.vars = "diff.time.from.start",
measure.vars=c("diff.scaled.stride.len",
"diff.scaled.stride.height",
"diff.stride.duration"
)
) # ggplot data needs to be tall
assign(paste0("g",i),
ggplot(data = df_transformed,
aes(x=diff.time.from.start, y=value, group=variable,
color=variable,shape=variable)) +
geom_line() + theme_bw() +
ggtitle("Changepoints for the Median Filtered Differenced Data (at vertical black lines)") +
theme(legend.position="none",
#axis.text.x=element_text(angle=90,hjust=1),
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~variable,nrow=3, scales = "free_y") +
geom_vline(xintercept= y$estimates)
)
print(get(paste0("g",i)))
cat('\n')
cat(paste0('<source> <p> Based on the <b>ecp package</b>, the number of changepoints for the median filtered differenced data for participant ', i,
' is equal to: ',
length(changepoints_medianf_diff[[i]]),
'. These changepoints are located at: ',
paste(changepoints_medianf_diff[[i]], collapse = ", "),
'. The results from the analysis above can be found at: <a href="https://github.com/fmegahed/fatigue-changepoint/blob/master/Data/RGenerated/ECPChangePointsDiffMFData.RData">ECPChangePointsDiffMFData.RData</a>. </p> </source>')
)
cat('\n \n')
}
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 1 is equal to: 1. These changepoints are located at: 4501. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 2 is equal to: 1. These changepoints are located at: 541. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 3 is equal to: 1. These changepoints are located at: 1261. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 4 is equal to: 1. These changepoints are located at: 541. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 5 is equal to: 1. These changepoints are located at: 181. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 6 is equal to: 1. These changepoints are located at: 3061. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 7 is equal to: 1. These changepoints are located at: 1441. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 8 is equal to: 1. These changepoints are located at: 3961. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 9 is equal to: 1. These changepoints are located at: 2341. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 10 is equal to: 1. These changepoints are located at: 6481. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 11 is equal to: 1. These changepoints are located at: 4681. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 12 is equal to: 1. These changepoints are located at: 2341. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 13 is equal to: 1. These changepoints are located at: 5941. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 14 is equal to: 1. These changepoints are located at: 1981. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
Based on the ecp package, the number of changepoints for the median filtered differenced data for participant 15 is equal to: 1. These changepoints are located at: 6121. The results from the analysis above can be found at: ECPChangePointsDiffMFData.RData.
save(changepoints_medianf_diff,
file = "./Data/RGenerated/ECPChangePointsDiffMFData.RData")
Department of Mechanical and Aerospace Engineering, University at Buffalo↩
Department of Industrial and Systems Engineering, University at Buffalo↩
Farmer School of Business, Miami University↩
Farmer School of Business, Miami University. This author can be reached by email at fmegahed@miamioh.edu.↩